SYS 6018 | Spring 2021 | University of Virginia


1 Introduction

Tell the reader what this project is about. Motivation.

2 Training Data / EDA

Load data, explore data, etc.

# Load Required Packages
library(tidyverse)
library(pROC)
library(randomForest)
library(GGally)
library(gridExtra)
library(plotly)
haiti <- read_csv("HaitiPixels.csv")
print(dim(haiti))
#> [1] 63241     4
head(haiti)
#> # A tibble: 6 x 4
#>   Class        Red Green  Blue
#>   <chr>      <dbl> <dbl> <dbl>
#> 1 Vegetation    64    67    50
#> 2 Vegetation    64    67    50
#> 3 Vegetation    64    66    49
#> 4 Vegetation    75    82    53
#> 5 Vegetation    74    82    54
#> 6 Vegetation    72    76    52

The dataframe contains 4 columns, and 63,241 rows. The Class column contains the correct label for the observation. Red, Green and Blue parameters are NEED TO INCLUDE CORRECT DEFINITION

2.1 Class Factor

To prepare the data for exploratory data analysis I must make Class a factor.

haiti %>% 
  mutate(Class = factor(Class)) 
#> # A tibble: 63,241 x 4
#>    Class        Red Green  Blue
#>    <fct>      <dbl> <dbl> <dbl>
#>  1 Vegetation    64    67    50
#>  2 Vegetation    64    67    50
#>  3 Vegetation    64    66    49
#>  4 Vegetation    75    82    53
#>  5 Vegetation    74    82    54
#>  6 Vegetation    72    76    52
#>  7 Vegetation    71    72    51
#>  8 Vegetation    69    70    49
#>  9 Vegetation    68    70    49
#> 10 Vegetation    67    70    50
#> # ... with 63,231 more rows
haiti %>%
  group_by(Class) %>%
  summarize(N = n()) %>%
  mutate(Perc = round(N / sum(N), 2) * 100)
#> # A tibble: 5 x 3
#>   Class                N  Perc
#> * <chr>            <int> <dbl>
#> 1 Blue Tarp         2022     3
#> 2 Rooftop           9903    16
#> 3 Soil             20566    33
#> 4 Various Non-Tarp  4744     8
#> 5 Vegetation       26006    41

The records are not evenly distributed between the categories. Of the Classes Blue Tarp, our “positive” category if we are thinking a binary positive/negative identification, is only 3% of our sample. Soil and Vegetation make up the majority of our sample at 74%.

2.2 Binary Class Factor vs. 5 Class Factor

It will be interesting to see performance predicting each of these categories, or a binary is or is not Blue Tarp.

2.2.1 Create Binary DataFrame

Create a DataFrame that is only Blue Tarp, or not Blue Tarp: * 0 == Not a Blue Tarp * 1 == Is a Blue Tarp

haitiBinary <- haiti %>%
  mutate(ClassBinary = if_else(Class == 'Blue Tarp', '1', '0'), ClassBinary = factor(ClassBinary))
haitiBinary %>%
  group_by(ClassBinary) %>%
  summarize(N = n()) %>%
  mutate(Perc = round(N / sum(N), 2) * 100)
#> # A tibble: 2 x 3
#>   ClassBinary     N  Perc
#> * <fct>       <int> <dbl>
#> 1 0           61219    97
#> 2 1            2022     3

2.2.2 How are red, blue and green values distributed between the 5 categories?

redplot <- ggplot(haiti, aes(x=Class, y=Red)) + 
  geom_boxplot(col='red')

greenplot <- ggplot(haiti, aes(x=Class, y=Green)) + 
  geom_boxplot(col='darkgreen')

blueplot <- ggplot(haiti, aes(x=Class, y=Blue)) + 
  geom_boxplot(col='darkblue')

grid.arrange(redplot, greenplot, blueplot)

2.2.3 How are red, blue and green values distributed between the binary categories?

redplotB <- ggplot(haitiBinary, aes(x=ClassBinary, y=Red)) + 
  geom_boxplot(col='red')

greenplotB <- ggplot(haitiBinary, aes(x=ClassBinary, y=Green)) + 
  geom_boxplot(col='darkgreen')

blueplotB <- ggplot(haitiBinary, aes(x=ClassBinary, y=Blue)) + 
  geom_boxplot(col='darkblue')

grid.arrange(redplotB, greenplotB, blueplotB)

Box Plot Comments

“Blue Tarp” as the “positive” result, and other results as the “negative” result.

Regarding the box plot of the five categories, of interest is that “Soil” and “Vegetation” are relatively unique in their non-outlier RGB values. “Rooftop” and “Various Non-Tarp” are more similar in their RBG values.

If the classes are collapsed to binary values of “Blue Tarp (1)” and “Not Blue Tarp (0)” there is little overlap in the blue values for the two classes, and the ranges of red and green are much smaller for blue tarp than non-blue-tarp.

Generally, the values of red have a larger range for negative results than for positive results, and the positive results have a similar median to the negative results. Green values have a larger range for negative results than for positive results, and the positive results have a higher median than the negative results, and there is almost no overlap in the blue data with non-blue tarps, and blue tarps.

2.2.4 View the correlation between Red, Green and Blue

These correlations make sense as the pixels were of highly saturated colors, that are not pure Blue, Red or Green. There are few pixels in the data set with low values for R,G,B.

ggpairs(haiti[-1], lower = list(continuous = "points", combo = "dot_no_facet"), progress = F)

### 3-D Scatterplot

To view the relationship between the Red, Green, and Blue values between the five classes, and the binary classes, an interactive 3-D scatter plot is extremely useful.

References https://plotly.com/python/3d-scatter-plots/

https://plotly.com/r/figure-labels/

2.2.4.1 Five Categories 3-D Scatterplot

The scatter plot displays

fiveCat3D = plot_ly(x=haiti$Red, y=haiti$Blue, z=haiti$Green, type="scatter3d", mode="markers", color=haiti$Class, colors = c('blue2','azure4','chocolate4','coral2','chartreuse4'),
marker = list(symbol = 'circle', sizemode = 'diameter', opacity =0.35))

fiveCat3D = fiveCat3D %>%
  layout(title="5 Category RBG Plot", scene = list(xaxis = list(title = "Red", color="red"), yaxis = list(title = "Blue", color="blue"), zaxis = list(title = "Green", color="green")))

fiveCat3D

One can see that there are discernible groupings of pixel categories by RGB values. Unsurprisingly, the blue tarps are higher blue values, but they do have a range of red and green values.

The 3D scatter plot is particularly useful because, by zooming in, one can see that there is a space in the 3D plot with significant mingling of “blue tarp” pixels and other pixel categories. That area of the data will provide a challenge for our model.

binary3D = plot_ly(x=haitiBinary$Red, y=haitiBinary$Blue, z=haitiBinary$Green, type="scatter3d", mode="markers", color=haitiBinary$ClassBinary, colors = c('red','blue2'),
marker = list(symbol = 'circle', sizemode = 'diameter', opacity =0.35))

binary3D = binary3D %>%
  layout(title="Binary RBG Plot", scene = list(xaxis = list(title = "Red", color="red"), yaxis = list(title = "Blue", color="blue"), zaxis = list(title = "Green", color="green")))

binary3D

Comments

Similar to the five category 3D scatter plot, the binary scatter plot shows distinct groupings for blue tarp and non-blue-tarp. As expected, there is mingling of blue tarp and non-blue-tarp pixels that will provide a challenge for a model.

3 Model Training

3.1 Set-up

Normalization does not need to be considered because the ranges of Red, Green and Blue are the same.

3.1.1 Training and Test Data

The DataFrame must be divided into training and test data sets; however, the data set is unbalanced with few “positive” results, i.e. “Blue Tarp”, compared with negative results.

3.2 Logistic Regression

3.3 LDA

3.4 QDA

3.5 KNN

3.5.1 Tuning Parameter \(k\)

How were tuning parameter(s) selected? What value is used? Plots/Tables/etc.

3.6 Penalized Logistic Regression (ElasticNet)

3.6.1 Tuning Parameters

NOTE: PART II same as above plus add Random Forest and SVM to Model Training.

3.7 Threshold Selection

4 Results (Cross-Validation)

** CV Performance Table Here**

5 Conclusions

5.0.1 Conclusion #1

5.0.2 Conclusion #2

5.0.3 Conclusion #3

6 Hold-out Data / EDA

Load data, explore data, etc.

7 Results (Hold-Out)

Hold-Out Performance Table Here

8 Final Conclusions

8.0.1 Conclusion #1

8.0.2 Conclusion #2

8.0.3 Conclusion #3

8.0.4 Conclusion #4

8.0.5 Conclusion #5

8.0.6 Conclusion #6